Introduction

Bay Area Bike Share is a company that provides on-demand bike rentals for customers in San Francisco, Redwood City, Palo Alto, Mountain View, and San Jose. The bikes can be unlocked from one station and returned to any other station in the system, making them ideal for one-way trips. People use bike share to commute to work or school, run errands, get to appointments or social engagements and more. It’s a fun, convenient and affordable way to get around. Users pay for the service either through a yearly subscription or by purchasing 3-day or 24-hour passes. Users can make an unlimited number of trips, with trips under thirty minutes in length having no additional charge; longer trips will incur overtime fees.

Bay Area Bike Share - Diridon Station

Quick Peek

Let’s get a quick peek at the trips and stations dataset. The trips dataset contains information about trip id, duration of the trip, start station, end station, etc. The stations dataset gives us spatial information, dock count, etc. for each station. We merge the trips and stations dataset in order to map the latitude and longitude data to start and end stations in the trips dataset. Moreover, we create the columns start_month, start_hour, start_day, weekday, etc. to help us dig deeper in the dataset.

Trips

trips_dataset <- read.csv("data/trips_dataset.csv")
Table(trips_dataset,limit = 20,scroll = T)
trip_id duration_sec start_date start_station_name start_station_id end_date end_station_name end_station_id bike_number zip_code subscriber_type
944732 2618 2015-09-24 17:22:00 Mezes 83 2015-09-24 18:06:00 Mezes 83 653 94063 Customer
1030383 5780 2015-12-06 10:52:00 Mezes 83 2015-12-06 12:28:00 Mezes 83 44 94064 Customer
1102641 801 2016-02-23 12:25:00 Mezes 83 2016-02-23 12:39:00 Mezes 83 174 93292 Customer
1102613 1334 2016-02-23 12:00:00 San Mateo County Center 23 2016-02-23 12:22:00 Mezes 83 174 93292 Customer
1114916 1642 2016-03-03 21:54:00 Redwood City Public Library 24 2016-03-03 22:22:00 Mezes 83 208 94063 Customer
1114919 1377 2016-03-03 22:00:00 Redwood City Public Library 24 2016-03-03 22:23:00 Mezes 83 52 94063 Customer
1210164 232 2016-05-20 21:33:00 Redwood City Caltrain Station 22 2016-05-20 21:37:00 Mezes 83 685 94063 Customer
249715 1702 2014-04-15 20:17:00 Japantown 9 2014-04-15 20:45:00 Japantown 9 49 60305 Customer
871985 9699 2015-08-01 19:59:00 Japantown 9 2015-08-01 22:41:00 Japantown 9 488 95054 Customer
212536 872 2014-03-13 12:13:00 Japantown 9 2014-03-13 12:28:00 Japantown 9 672 95031 Customer
363634 1334 2014-07-14 23:40:00 Japantown 9 2014-07-15 00:02:00 Japantown 9 112 95131 Customer
252213 6320 2014-04-17 15:09:00 Japantown 9 2014-04-17 16:54:00 Japantown 9 659 94587 Customer
150636 545 2014-01-12 15:08:00 Japantown 9 2014-01-12 15:18:00 Japantown 9 80 95148 Customer
363633 1361 2014-07-14 23:39:00 Japantown 9 2014-07-15 00:02:00 Japantown 9 12 95121 Customer
163424 3813 2014-01-23 13:41:00 Japantown 9 2014-01-23 14:44:00 Japantown 9 59 95158 Customer
252212 6324 2014-04-17 15:08:00 Japantown 9 2014-04-17 16:54:00 Japantown 9 176 94587 Customer
1279192 320 2016-07-16 09:49:00 Japantown 9 2016-07-16 09:55:00 Japantown 9 64 95132 Customer
871977 10162 2015-08-01 19:51:00 Japantown 9 2015-08-01 22:40:00 Japantown 9 130 95051 Customer
158467 1392 2014-01-18 18:43:00 Japantown 9 2014-01-18 19:07:00 Japantown 9 159 94564 Customer
881594 2775 2015-08-08 16:37:00 Japantown 9 2015-08-08 17:23:00 Japantown 9 104 95136 Customer

Stations

stations_dataset <- read.csv("data/stations_dataset.csv")
Table(stations_dataset,limit = 20,scroll = T)
station_id name latitude longitude dockcount landmark installation_date
4 Santa Clara at Almaden 37.33399 -121.8949 11 San Jose 2013-08-06
84 Ryland Park 37.34273 -121.8956 15 San Jose 2014-04-09
8 San Salvador at 1st 37.33017 -121.8858 15 San Jose 2013-08-05
9 Japantown 37.34874 -121.8947 15 San Jose 2013-08-05
3 San Jose Civic Center 37.33070 -121.8890 15 San Jose 2013-08-05
13 St James Park 37.33930 -121.8899 15 San Jose 2013-08-06
10 San Jose City Hall 37.33739 -121.8870 15 San Jose 2013-08-06
16 SJSU - San Salvador at 9th 37.33396 -121.8773 15 San Jose 2013-08-07
7 Paseo de San Antonio 37.33380 -121.8869 15 San Jose 2013-08-07
6 San Pedro Square 37.33672 -121.8941 15 San Jose 2013-08-07
80 Santa Clara County Civic Center 37.35260 -121.9057 15 San Jose 2013-12-31
14 Arena Green / SAP Center 37.33269 -121.9001 19 San Jose 2013-08-05
5 Adobe on Almaden 37.33141 -121.8932 19 San Jose 2013-08-05
11 MLK Library 37.33588 -121.8857 19 San Jose 2013-08-06
12 SJSU 4th at San Carlos 37.33281 -121.8839 19 San Jose 2013-08-07
89 S. Market st at Park Ave 37.33240 -121.8904 19 San Jose 2016-06-05
88 5th S. at E. San Salvador St 37.33196 -121.8816 19 San Jose 2016-06-05
2 San Jose Diridon Caltrain Station 37.32973 -121.9018 27 San Jose 2013-08-06
37 Cowper at University 37.44860 -122.1595 11 Palo Alto 2013-08-14
35 University and Emerson 37.44452 -122.1631 11 Palo Alto 2013-08-15

Trips_Stations

## Get latitude-longitude and landmark details for start station and end station

trips_stations_dataset <- trips_dataset %>% left_join(stations_dataset,by = c("start_station_id"="station_id")) %>% mutate(start_station_latitude = latitude, start_station_longitude = longitude,start_station_landmark = landmark) %>% select(-c(latitude,longitude,dockcount,installation_date,name,landmark)) %>% left_join(stations_dataset,by = c("end_station_id"="station_id")) %>% mutate(end_station_latitude = latitude, end_station_longitude = longitude, end_station_landmark = landmark) %>% select(-c(latitude,longitude,dockcount,installation_date,name,landmark))

## Create start_hour, start_month, start_year, start_day, weekday

trips_stations_dataset <- trips_stations_dataset %>% mutate(start_year = year(start_date), start_month = month(start_date,label = T), start_day = day(start_date), start_hour = hour(start_date), weekday = wday(start_date,label = T))

Table(trips_stations_dataset,limit = 20,scroll = T)
trip_id duration_sec start_date start_station_name start_station_id end_date end_station_name end_station_id bike_number zip_code subscriber_type start_station_latitude start_station_longitude start_station_landmark end_station_latitude end_station_longitude end_station_landmark start_year start_month start_day start_hour weekday
944732 2618 2015-09-24 17:22:00 Mezes 83 2015-09-24 18:06:00 Mezes 83 653 94063 Customer 37.49127 -122.2362 Redwood City 37.49127 -122.2362 Redwood City 2015 Sep 24 17 Thu
1030383 5780 2015-12-06 10:52:00 Mezes 83 2015-12-06 12:28:00 Mezes 83 44 94064 Customer 37.49127 -122.2362 Redwood City 37.49127 -122.2362 Redwood City 2015 Dec 6 10 Sun
1102641 801 2016-02-23 12:25:00 Mezes 83 2016-02-23 12:39:00 Mezes 83 174 93292 Customer 37.49127 -122.2362 Redwood City 37.49127 -122.2362 Redwood City 2016 Feb 23 12 Tue
1102613 1334 2016-02-23 12:00:00 San Mateo County Center 23 2016-02-23 12:22:00 Mezes 83 174 93292 Customer 37.48762 -122.2300 Redwood City 37.49127 -122.2362 Redwood City 2016 Feb 23 12 Tue
1114916 1642 2016-03-03 21:54:00 Redwood City Public Library 24 2016-03-03 22:22:00 Mezes 83 208 94063 Customer 37.48422 -122.2274 Redwood City 37.49127 -122.2362 Redwood City 2016 Mar 3 21 Thu
1114919 1377 2016-03-03 22:00:00 Redwood City Public Library 24 2016-03-03 22:23:00 Mezes 83 52 94063 Customer 37.48422 -122.2274 Redwood City 37.49127 -122.2362 Redwood City 2016 Mar 3 22 Thu
1210164 232 2016-05-20 21:33:00 Redwood City Caltrain Station 22 2016-05-20 21:37:00 Mezes 83 685 94063 Customer 37.48608 -122.2321 Redwood City 37.49127 -122.2362 Redwood City 2016 May 20 21 Fri
249715 1702 2014-04-15 20:17:00 Japantown 9 2014-04-15 20:45:00 Japantown 9 49 60305 Customer 37.34874 -121.8947 San Jose 37.34874 -121.8947 San Jose 2014 Apr 15 20 Tue
871985 9699 2015-08-01 19:59:00 Japantown 9 2015-08-01 22:41:00 Japantown 9 488 95054 Customer 37.34874 -121.8947 San Jose 37.34874 -121.8947 San Jose 2015 Aug 1 19 Sat
212536 872 2014-03-13 12:13:00 Japantown 9 2014-03-13 12:28:00 Japantown 9 672 95031 Customer 37.34874 -121.8947 San Jose 37.34874 -121.8947 San Jose 2014 Mar 13 12 Thu
363634 1334 2014-07-14 23:40:00 Japantown 9 2014-07-15 00:02:00 Japantown 9 112 95131 Customer 37.34874 -121.8947 San Jose 37.34874 -121.8947 San Jose 2014 Jul 14 23 Mon
252213 6320 2014-04-17 15:09:00 Japantown 9 2014-04-17 16:54:00 Japantown 9 659 94587 Customer 37.34874 -121.8947 San Jose 37.34874 -121.8947 San Jose 2014 Apr 17 15 Thu
150636 545 2014-01-12 15:08:00 Japantown 9 2014-01-12 15:18:00 Japantown 9 80 95148 Customer 37.34874 -121.8947 San Jose 37.34874 -121.8947 San Jose 2014 Jan 12 15 Sun
363633 1361 2014-07-14 23:39:00 Japantown 9 2014-07-15 00:02:00 Japantown 9 12 95121 Customer 37.34874 -121.8947 San Jose 37.34874 -121.8947 San Jose 2014 Jul 14 23 Mon
163424 3813 2014-01-23 13:41:00 Japantown 9 2014-01-23 14:44:00 Japantown 9 59 95158 Customer 37.34874 -121.8947 San Jose 37.34874 -121.8947 San Jose 2014 Jan 23 13 Thu
252212 6324 2014-04-17 15:08:00 Japantown 9 2014-04-17 16:54:00 Japantown 9 176 94587 Customer 37.34874 -121.8947 San Jose 37.34874 -121.8947 San Jose 2014 Apr 17 15 Thu
1279192 320 2016-07-16 09:49:00 Japantown 9 2016-07-16 09:55:00 Japantown 9 64 95132 Customer 37.34874 -121.8947 San Jose 37.34874 -121.8947 San Jose 2016 Jul 16 9 Sat
871977 10162 2015-08-01 19:51:00 Japantown 9 2015-08-01 22:40:00 Japantown 9 130 95051 Customer 37.34874 -121.8947 San Jose 37.34874 -121.8947 San Jose 2015 Aug 1 19 Sat
158467 1392 2014-01-18 18:43:00 Japantown 9 2014-01-18 19:07:00 Japantown 9 159 94564 Customer 37.34874 -121.8947 San Jose 37.34874 -121.8947 San Jose 2014 Jan 18 18 Sat
881594 2775 2015-08-08 16:37:00 Japantown 9 2015-08-08 17:23:00 Japantown 9 104 95136 Customer 37.34874 -121.8947 San Jose 37.34874 -121.8947 San Jose 2015 Aug 8 16 Sat

Exploratory Data Analysis

Quick Facts

  • There are 84 stations in the dataset across 5 cities.
  • The dataset consists of trips from 29th August 2013 to 31st August 2016.
  • The total number of rides is 9,83,648.
  • The total duration of the rides in the dataset is 31 years 9 month 3 days 21 hours 21 minutes.
  • The average duration of each ride in the dataset is 17 minutes.

Trips by subscription

Let’s look at how the trips are divided by subscription type.

# Group by subscriber type and count number of trips.
trips_by_subscription <- trips_stations_dataset %>% group_by(subscriber_type) %>% summarise(total_trips = n())

plot_trips_by_subscription <- ggplot(data = trips_by_subscription) + geom_col(mapping = aes(x=subscriber_type,y=total_trips,fill= subscriber_type)) + xlab("Subscriber Type") + ylab("No of Trips") + ggtitle("Trips by Subscription") + geom_text(aes(x = subscriber_type,y = total_trips,label = total_trips),position = position_dodge(width = 0.9),vjust = -0.25) + scale_y_continuous(labels = scales::comma)  + guides(fill = guide_legend(title = "Subscriber Type"))

plot_trips_by_subscription

  • It can be seen in the plot above that the number of trips by subscribers are a lot more than the customers.

  • In the dataset, 86.1% of the rides are taken by the subsribers whereas only 13.9% of the rides are taken by the customers.

Trips by month

The bike sharing service is available throughout the year. Let’s look at the total numbers of rides at a monthly level and see if we can find any interesting pattern.

trips_by_month <- trips_stations_dataset %>% group_by(start_month,subscriber_type) %>% summarise(total_trips = n())

plot_trips_by_month <- ggplot(data = trips_by_month) + geom_col(mapping = aes(x=start_month,y=total_trips,fill = subscriber_type),position = position_dodge2()) + xlab("Month") + ylab("No of Trips") + ggtitle("Trips by Month") + scale_y_continuous(labels = scales::comma)  + guides(fill = guide_legend(title = "Subscriber Type"))

plot_trips_by_month

  • More number of trips are taken by subscribers from March to October as compared to November to February. The dip in December must be due to less usage of bikes as users are on vacation.

  • The weather from Nov-Feb is relatively cold. Some of the users must be using alternative mode of transporation in cold weather.

  • Similary, customers have taken more trips from March to October than from November to February.

Trips by Weekday

Now we will group the rides by weekday and analyze the usage pattern of subsribers and consumers over weekday as well as weekend.

trips_by_weekday <- trips_stations_dataset %>% group_by(weekday,subscriber_type) %>% summarise(total_trips = n())

plot_trips_by_weekday <- ggplot(data = trips_by_weekday) + geom_col(mapping = aes(x=weekday,y=total_trips,fill = subscriber_type),position = position_dodge2())  + xlab("Weekday") + ylab("No of Trips") + ggtitle("Trips by Weekday") + scale_y_continuous(labels = scales::comma)  + guides(fill = guide_legend(title = "Subscriber Type"))

plot_trips_by_weekday 

  • The subscribers show a high usage pattern on weekdays. This must be due to the subsribers using the bikes for day-to-day commute, going for lunch,dinner, etc.

  • The customers show low usage over weekdays . However, over the weekends, the number of trips by the customers is almost equal to subscribers. This must be due to more customers using bikes over weekend for leisure purpose.

Trips by Hour

Now we will group the trips by hour and observe the hourly usage of subcribers and consumers.

trips_by_hour <- trips_stations_dataset %>% group_by(start_hour,subscriber_type) %>% summarise(total_trips = n())

plot_trips_by_hour <- ggplot(data = trips_by_hour) + geom_col(mapping = aes(x=start_hour,y=total_trips,fill = subscriber_type),position = position_dodge2()) + xlab("Hour") + ylab("No of Trips") + ggtitle("Trips by Hour") + scale_y_continuous(labels = scales::comma)  + guides(fill = guide_legend(title = "Subscriber Type"))

plot_trips_by_hour

  • The subscribers hourly usage follows a bimodal distributions. It can be seen that the usage peaks at 8AM and 5PM. This must be due to the subsribers riding the bike to work in the morning and returning home in the evening.

  • The customers hourly usage follows a normal distribution. The customers show high usage from 12PM to 4 PM.

Trips by Hour and Weekday

Let’s analyze how the ride usage changes by hour and weekday. We will group the trips by hour and weekday and overlay the number of trips as heatmap.

trips_hour_weekday <-trips_stations_dataset %>% group_by(weekday,start_hour,subscriber_type) %>% summarise(total_trips = n()) %>% ungroup() %>% mutate( subscriber_type = factor( subscriber_type))

plot_trips_hour_weekday <- trips_hour_weekday %>% ggplot(aes(x = start_hour,y = weekday,fill = total_trips)) + geom_tile(color="white",size = 0.1) + coord_equal() + facet_wrap( ~subscriber_type, ncol = 1 ) + scale_fill_viridis(name = "Total Trips", label = comma)+
 labs(x = "Hour", y = "Weekday", title = "Trips by Hour and Weekday") + theme_tufte(base_family = "Helvetica") + theme( axis.ticks = element_blank(),
       plot.title = element_text(hjust = 0.5),
       legend.title = element_text(size = 8),
       legend.text = element_text(size = 6) )

plot_trips_hour_weekday

  • In the above heatmap for subscribers, we can see that there is high usage on weekdays from 7-9 AM and 4-6PM. This must be due to subscribers using the bike for work commute.

  • We can also see moderate usage by subscribers at 12PM. This must be due to users going for lunch on bikes.

  • Customers tend to use bikes more on weekends between 11AM to 4PM. They must be using the bike for leisure purpose.

Trips by duration

As stated above, the average duration of the trip is 17 minutes. Let’s observe the usage duration distribution for customers and subscribers.

trips_by_duration <- trips_stations_dataset %>% mutate(duration_minutes = ifelse((duration_sec/60.0)>60,60,duration_sec/60.0))

plot_trips_by_duration_subscriber <- trips_by_duration %>% filter(subscriber_type=="Subscriber") %>% ggplot()  + geom_histogram(mapping = aes(duration_minutes),fill = "blue",bins = 50) + scale_y_continuous(labels = scales::comma) + labs(x = "Duration (minutes)", y = "No of Trips") 

plot_trips_by_duration_customer <- trips_by_duration %>% filter(subscriber_type=="Customer") %>% ggplot()  + geom_histogram(mapping = aes(duration_minutes),fill = "blue",bins = 50) + scale_y_continuous(labels = scales::comma) + labs(x = "Duration (minutes)", y = "No of Trips") 

cowplot::plot_grid(plot_trips_by_duration_customer,plot_trips_by_duration_subscriber,labels = c("Customer","Subscriber"),label_size = 12,hjust = -1.5)

  • The usage duration distribution for riders for duration less than 60 minutes peaks around 8-10 minutes.

  • However, most of the trips taken by the customers have duration more than 60 minutes. This must be due to customers hiring the bike for shopping, or leisue ride around the city,

  • Most of trips taken by the subscriber last less than half an hour. This must be due to subscribers using the bike for specific purpose like going to office, market, etc.

Heatmap

Now we will analyze the density heatmap of origin and destination stations. This will help us visualize popular bike pick up and drop stations.

Origin Station Heatmap

First, let us look at the heatmap of origin station.

plot_trips_start_by_station <- leaflet(trips_start_by_station) %>% addProviderTiles(providers$OpenStreetMap) %>% addScaleBar() %>% addHeatmap(lat = ~start_station_latitude, lng = ~start_station_longitude, intensity = ~total_trips_normalized, minOpacity = 0.1, max = 0.7)

plot_trips_start_by_station
  • The heatmap above shows that more number of trips start in San Fransisco followed by San Jose, Mountain View, Palo Alto, Redwood City.

  • Zooming in further into the heatmap shows San Francisco Caltrain (Townsend at 4th) and San Francisco Caltrain 2 (330 Townsend) as the most popular starting stations. Both the stations are outside the Caltrain station. Hence, a lot of users getting out of the station must be using it.

Destination Station Heatmap

Now let us look at the heatmap for destination stations.

plot_trips_end_by_station <- leaflet(trips_end_by_station) %>% addProviderTiles(providers$OpenStreetMap) %>% addScaleBar() %>% addHeatmap(lat = ~end_station_latitude, lng = ~end_station_longitude, intensity = ~total_trips_normalized, minOpacity = 0.1, max = 0.7)

plot_trips_end_by_station
  • The popular destination stations are also located in San Fransico as seen in the heatmap above.

  • Here also, the popular destination stations are San Francisco Caltrain (Townsend at 4th) and San Francisco Caltrain 2 (330 Townsend). This must be due to riders parking the bikes here before boarding the train.

Bike Inflow/Outflow per station

Now we analyze the inflow and outflow of bikes per station. This will help us to address the bike distribution problem.

The dots with red hue shows the stations which have more bike pickups than drops.The dots with yello hue shows almost equal pickup and drops whereas the green dots show low bike pickups than drops. Clicking on the station gives us info about the station name, number of trips started and no of trips ended at the station.

trips_start_end_total <-
  inner_join(
    trips_start_by_station,
    trips_end_by_station,
    by = c("start_station_name" = "end_station_name"),
    suffix = c("_started_here", "_ended_here")
  ) %>% dplyr::select(-c(end_station_latitude, end_station_longitude)) %>%  mutate(diff_start_end = total_trips_ended_here - total_trips_started_here) %>% mutate(content = paste(
    paste("Station Name: ", start_station_name),
    paste("Trips Started here: ", total_trips_started_here),
    paste("Trips Ended here: ", total_trips_ended_here),
    sep = "<br/>"
  ))

palData <- classInt::classIntervals(trips_start_end_total$diff_start_end,style = "quantile")
colPalette <- RColorBrewer::brewer.pal(n = 9, name = "RdYlGn")

trips_start_end_total <- trips_start_end_total %>% ungroup %>% mutate(colors = classInt::findColours(palData,colPalette))

plot_trips_start_end_total <- leaflet(trips_start_end_total) %>% addProviderTiles(providers$CartoDB.DarkMatter) %>% setView(lng = -122.419416,lat = 37.774929,zoom = 12) %>% addCircles(lng = ~start_station_longitude,lat = ~start_station_latitude,opacity = 1,fillOpacity = 1,color  = ~colors,stroke = T,popup = ~content,radius = 25)

plot_trips_start_end_total
  • It can be seen that the stations such as Grant Avenue at Columbus Avenue and 2nd at Folsom have significantly more trips starting. Whereas the stations such as San Francisco Caltrain (Townsend at 4th) and Market at Sansome have more trips ending there.

  • This can help us develop the system to transport the bikes from low traffic stations to high traffic stations.

Recommendations

  • Efficient bike transportation system should be put in place to transport bikes from low traffic stations to high traffic stations.

  • A model can be built to forecast bike demands at each station. This will help us optimize the bike deployment system and hence increase the revenue.

  • Discounts should be offered during weekends and non-peak hours. This will help increase the bike utilization and increase the revenue.

  • Marketing promotions should be targetted on regular customers to convert them into subscribers.

  • Surge pricing should be implemented based on the outflow and inflow at each station.